The Application of NLTK Library for Python Natural Language Processing in Corpus Research

نویسندگان

چکیده

Corpora play an important role in linguistics research and foreign language teaching. At present, the relevant on corpus China mainly uses WordSmith, Antconc other retrieval tools. NLTK library, which is based Python language, can provide more flexible rich methods, it use unified data standards to avoid trouble of various type conversion. same time, with help Python’s numerous third-party libraries, make up for shortcomings tools syntax analysis, graphic rendering, regular expression aspects. In terms main links research, such as text cleaning, word form restoration, part speech tagging statistics, this paper takes US presidential inaugural example show how tool process data, introduces application library research.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

NLTK: The Natural Language Toolkit

The Natural Language Toolkit is a suite of program modules, data sets, tutorials and exercises, covering symbolic and statistical natural language processing. NLTK is written in Python and distributed under the GPL open source license. Over the past three years, NLTK has become popular in teaching and research. We describe the toolkit and report on its current state of development.

متن کامل

creating appropriate corpus for information retrieval and natural language processing in persian language

persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. th...

متن کامل

Web Text Corpus for Natural Language Processing

Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For t...

متن کامل

Corpus Design For Biomedical Natural Language Processing

This paper classifies six publicly available biomedical corpora according to various corpus design features and characteristics. We then present usage data for the six corpora. We show that corpora that are carefully annotated with respect to structural and linguistic characteristics and that are distributed in standard formats are more widely used than corpora that are not. These findings have...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Theory and Practice in Language Studies

سال: 2021

ISSN: ['1799-2591', '2053-0692']

DOI: https://doi.org/10.17507/tpls.1109.09